BTS 510 Lab 11

set.seed(12345)
library(tidyverse)
Warning: package 'purrr' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Stat2Data)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Conduct logistic regression for binary outcomes
  • Interpret results in terms of probability, odds, and log-odds

2 Data

  • ICU data from the Stat2Data package: n = 200
    • ID: Patient ID code
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Age: Age (in years)
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)
    • Emergency: 1 = emergency admission or 0 = elective admission

3 Analysis

  • Replicate and extend the analysis from the lecture
    • Age and Emergency predict Survive
    • Be sure to mean center Age for interpretability

4 Pseudo R^2

  • For linear regression, R^2_{multiple} is calculated from the sums of squares from ordinary least squares estimation
  • Logistic regression is estimated via maximum likelihood
    • There are no sums of squares
    • There is, however, deviance
  • Deviance is a function of the log-likelihood
    • Deviance for a model is how far the model is from a theoretical perfect model where every observation is perfectly predicted
    • Relative measure so can only compare to another model
    • Big deviance (far from perfect) is worse than small deviance (closer to perfect)
  • Pseudo R^2
    • Compare your model to a model with no predictors (only intercept)
    • How much closer is this model to the perfect model than the intercept only model?
    • Theoretically ranges from 0 to 1 (but not in practice)
    • R^2_{deviance} = 1 - \frac{deviance_{M}}{deviance_{intercept\ only}}
    • Deviance for your model is “Residual deviance” from output
    • Deviance for intercept model is “Null deviance” from output

5 Confidence intervals

  • Calculate confidence intervals for all coefficients in a model using confint()
    • For example: confint(m1) to get confidence intervals for model m1
    • If the CI doesn’t include zero, the coefficient is significant
    • Default is 99% confidence interval
      • Use level argument to change
      • e.g., confint(m1, level = 0.99) produces 99% CIs
  • To convert from default logit metric to odds metric: Exponentiate
    • Logit metric: b_1
    • Convert to odds ratio: e^{b_1}
    • Do the same thing to the confidence intervals
      • exp(confint(m1)) produces the odds metric CIs
      • Compare to “no effect” value of one for odds
      • If the CI doesn’t include one, the coefficient is significant

6 Predicted values

  • Two main options
    • Use R like a calculator to calculate predicted values using equations
      • You can be fancy and grab the coefficients directly from the output
      • e.g., using coef(m1)[1] to get the first coefficient, etc.
      • Or just type the numbers in
    • Use predict() to calculate predicted values
      • Set up a new dataset with the predictor values you’re interested in
      • e.g., age_vals <- data.frame(AgeC = c(-20, 0, 20))
      • Then use that new data in predict()
      • e.g., predict(m1, newdata = age_vals, type = "response")
      • type = "response" gives predicted values in probability
      • type = "link" gives predicted values in logit

7 Tasks

  • Model 1: Survive ~ Age
  • Model 2: Survive ~ Age + Emergency
library(Stat2Data)
data(ICU)
ICU <- ICU %>% mutate(AgeC = Age - mean(Age, na.rm = TRUE))
m1 <- glm(data = ICU, 
          Survive ~ AgeC, 
          family = binomial(link = "logit"))
m2 <- glm(data = ICU, 
          Survive ~ AgeC + Emergency, 
          family = binomial(link = "logit"))
summary(m1)

Call:
glm(formula = Survive ~ AgeC, family = binomial(link = "logit"), 
    data = ICU)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.47357    0.19133   7.702 1.34e-14 ***
AgeC        -0.02754    0.01056  -2.607  0.00913 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 192.31  on 198  degrees of freedom
AIC: 196.31

Number of Fisher Scoring iterations: 4
summary(m2)

Call:
glm(formula = Survive ~ AgeC + Emergency, family = binomial(link = "logit"), 
    data = ICU)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.55130    0.73325   4.843 1.28e-06 ***
AgeC        -0.03402    0.01069  -3.181  0.00147 ** 
Emergency   -2.45354    0.75257  -3.260  0.00111 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 173.08  on 197  degrees of freedom
AIC: 179.08

Number of Fisher Scoring iterations: 6
  1. Conduct a likelihood ratio test to compare the models. Report the results. Which model is preferred?
anova(m1, m2, test = "LRT")
Analysis of Deviance Table

Model 1: Survive ~ AgeC
Model 2: Survive ~ AgeC + Emergency
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       198     192.31                          
2       197     173.08  1   19.231 1.158e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Model 2 (with both AgeC and Emergency predicting Survive) is better than model 1 (with only AgeC predicting Survive), \chi^2(1) = 19.231, p<.0001.
  1. Report the full results of the preferred model. This is similar to linear regression. Report the R^2 for the model. For each coefficient (including intercept), report:
  • Regression coefficient
  • Standard error
  • z statistic (note that there aren’t degrees of freedom for z tests)
  • p value
summary(m2)

Call:
glm(formula = Survive ~ AgeC + Emergency, family = binomial(link = "logit"), 
    data = ICU)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.55130    0.73325   4.843 1.28e-06 ***
AgeC        -0.03402    0.01069  -3.181  0.00147 ** 
Emergency   -2.45354    0.75257  -3.260  0.00111 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 173.08  on 197  degrees of freedom
AIC: 179.08

Number of Fisher Scoring iterations: 6
  • R^2_{deviance} = 1 - \frac{deviance_{model}}{deviance_{null}} = 1 - \frac{173.08}{200.16} = 0.135
    • The two predictors explain 13.5% of the variance in the outcome.
    • (For comparison, R^2_{deviance} for m1 is 0.039)
  • The intercept is significantly different from 0, b_0 = 3.55, s.e. = 0.73, z = 4.84, p<.0001.
    • The logit for a person at mean Age and non-Emergency admission is 3.55. This is greater than 0, indicating that a person of mean age and non-emergency admission has a greater than 0.5 probability of survival.
  • The slope for AgeC is significantly different from 0, b_1 = -0.03, s.e. = 0.01, z = -3.18, p<.01.
    • Likelihood of survival decreases as AgeC increases.
  • The slope for Emergency is significantly different from 0, b_2 = -2.45, s.e. = 0.75, z = -3.26, p<.01.
    • Likelihood of survival decreases as Emergency increases; emergency admissions (Emergency = 1) are less likely to survive than non-emergency admissions (Emergency = 0)
  1. Calculate predicted probabilities for 6 predictor combinations:
  • Emergency admit, 20 years younger than mean
  • Emergency admit, mean age
  • Emergency admit, 20 years older than mean
  • Non-emergency admit, 20 years younger than mean
  • Non-emergency admit, mean age
  • Non-emergency admit, 20 years older than mean
AgeC <- c(-20, 0, 20, -20, 0, 20)
Emergency <- c(0, 0, 0, 1, 1, 1)
vals <- data.frame(AgeC, Emergency)
predict(m2, newdata = vals, type = "response")
        1         2         3         4         5         6 
0.9856793 0.9721127 0.9463930 0.8554609 0.7498415 0.6028714 
  • Plot to help visualize this
ggplot(data = ICU,
       aes(x = AgeC, 
           y = Survive, 
           group = factor(Emergency), 
           color = factor(Emergency))) +
  geom_point(size = 3, alpha = 0.3) +
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE,
              fullrange = T) +
  xlim(-50, 50) + 
  theme(legend.position="none") +
  geom_vline(xintercept = -20, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 20, linetype = "dashed")
`geom_smooth()` using formula = 'y ~ x'

  1. Report the odds ratios for each predictor and their confidence intervals. Interpret the odds ratios in words.
confint(m2)
Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept)  2.35176991  5.37753834
AgeC        -0.05635795 -0.01413018
Emergency   -4.30286135 -1.19835339
exp(coef(m2))
(Intercept)        AgeC   Emergency 
34.85867876  0.96655588  0.08598906 
exp(confint(m2))
Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept) 10.50414466 216.4886964
AgeC         0.94520074   0.9859692
Emergency    0.01352979   0.3016906
  • Odds ratio for AgeC is 0.967 with 95% confidence interval [0.945, 0.986].
    • The odds ratio is less than 1, indicating that odds of survival decreases as age increases
    • The confidence interval does not include one so this is a significant effect
  • Odds ratio for Emergency is 0.086 with 95% confidence interval [0.014, 0.302].
    • The odds ratio is less than 1, indicating that odds of survival decreases for emergency admission compared to non-emergency admission
    • The confidence interval does not include one so this is a significant effect
  1. Describe the overall findings for this model. Be statistically accurate but avoid jargon and technical terms as much as you can. Be sure to use the names of the variables studied (i.e., survival, age, emergency admission) rather than X and Y.
  • Both older age and emergency admission negatively impact survival
    • Older patients are less likely to survive than younger patients
    • Emergency admission patients are less likely to survive than non-emergency admission patients
  • In terms of probability, the gap between emergency and non-emergency admission patients increases with age
    • For younger patients (20 years below the mean age), there is a 13 percentage point gap (0.986 vs 0.855)
    • For older patients (20 years above the mean age), there is a 34 percentage point gap (0.946 vs 0.603)
    • (I just added this because it demonstrates how you get an “interaction” without an interaction in these non-linear models)
  • There are probably other things you could say about this